Operations Runbook: Scaling & Adding Models

This micro-runbook describes the standard operating sequence to dynamically introduce an additional LLM asset to the active cluster configuration on the Mac Studio (M4 Max / 64GB Unified RAM). Follow these instructions precisely to ensure zero-downtime reconfiguration while observing strict hardware memory safety boundaries.

CRITICAL OPERATION STEP: Pre-Flight VRAM Evaluation
Because Apple Silicon shares system RAM dynamically with the GPU, the cluster safe operating target is capped at 42GB combined active weights (leaving a 6GB context memory expansion safety buffer out of the ~48GB Metal limit). Before deploying a new model, calculate the allocation footprint:

Current Core GGUF (Gemma-4 31B Q4_K_XL): ~18.5GB
Current Core MLX (Qwen-14B 4bit): ~9.5GB
Current Utilized Base: ~28.0GB | Available Scaling Headroom: ~14.0GB

The target model selected for this expansion is DeepSeek-Coder-6.7B-Instruct (Q4_K_M GGUF), which consumes roughly ~4.0GB, fitting safely within our VRAM headroom bounds.

Step-by-Step Deployment Routine

1 Sourcing and Downloading Model Weights

Activate the isolated Python virtual environment on the Mac Studio and use the Hugging Face CLI platform download link utility to pull down the validated layout file:

source ~/local-ai/venv-mlx/bin/activate
hf download Bartowski/deepseek-coder-6.7b-instruct-GGUF deepseek-coder-6.7b-instruct-q4_k_m.gguf --local-dir ~/local-ai/models/gguf

2 Provisioning a New Port Allocation Slot

Our foundational GGUF process uses port 8080, and our MLX process uses port 8081. We allocate port 8082 to this third engine. Launch it persistently inside a brand-new, isolated background terminal window manager context:

tmux new-session -d -s engine-coding '~/local-ai/bin/llama-server -m ~/local-ai/models/gguf/deepseek-coder-6.7b-instruct-q4_k_m.gguf --port 8082 --host 127.0.0.1 -c 8192 -np 1'

3 Updating Gateway Router Matrix Definitions

Open the gateway configuration mapping file (~/local-ai/configs/litellm_config.yaml) using a standard console shell editor like nano. Append the new model block precisely under the active model_list array array:

model_list:
  - model_name: production-deep-context
    litellm_params:
      model: openai/unsloth/gemma-4-31B-it-qat-GGUF
      api_base: http://127.0.0.1:8080/v1
      api_key: "not-needed"

  - model_name: production-ultra-fast
    litellm_params:
      model: openai/mlx-community/Qwen2.5-14B-Instruct-4bit
      api_base: http://127.0.0.1:8081/v1
      api_key: "not-needed"

  # APPEND THE THIRD DEPLOYMENT TARGET PRECISELY HERE:
  - model_name: production-coding-assistant
    litellm_params:
      model: openai/Bartowski/deepseek-coder-6.7b-instruct-GGUF
      api_base: http://127.0.0.1:8082/v1
      api_key: "not-needed"
      tpm: 150000
      rpm: 1500

4 Cycling the Proxy Runtime Cache

Force the LiteLLM Proxy routing engine to recycle its internal process state. This reads the newly appended YAML matrix variables without interrupting your baseline underlying model execution windows:

# Kill the single active proxy engine tracking window
tmux kill-session -t gateway-proxy

# Relaunch the proxy process with security and OpenAPI schema documentation bypass flags intact
tmux new-session -d -s gateway-proxy 'export NO_DOCS=true && export NO_REDOC=true && export NO_OPENAPI=true && source ~/local-ai/venv-mlx/bin/activate && litellm --config ~/local-ai/configs/litellm_config.yaml --port 4000 --host 0.0.0.0'

5 Verification and End-to-End Smoke Testing

First, execute tmux ls to verify that all 3 background windows are stable. Then, run this validation query from the console to verify that the entrypoint tracks traffic to port 8082:

curl -X POST http://localhost:4000/v1/chat/completions   -H "Authorization: Bearer sk_live_mac_studio_master_init_key_2026"   -H "Content-Type: application/json"   -d '{
    "model": "production-coding-assistant",
    "messages": [{"role": "user", "content": "Write a python function for quicksort."}]
  }'

Operational Maintenance & Troubleshooting Logs

Checking Engine State: If a model stops responding, execute tmux attach-session -t engine-coding to inspect the backend stream console. Press Ctrl + B then D to drop out safely.
Tokenizer Resolution Faults (404/401 Errors): Ensure that the text passed under the model: openai/... configuration string matches a valid Hugging Face asset repository structure path name exactly. LiteLLM checks this string to map precise tokenizer rules to the local Postgres database.
Port Overlap: Never bind two background services to the same local loopback address port number. If adding a fourth asset, increment the configuration baseline targets systematically to port 8083.

Adding and Provisioning New Models